arXiv:1501.03771vl [cs.CV] 15 Jan 2015 


1 


Submodular relaxation for inference in Markov 

random fields 

Anton Osokin Dmitry Vetrov 


Abstract —In this paper we address the problem of finding the most probable state of a discrete Markov random field (MRF), 
also known as the MRF energy minimization problem. The task is known to be NP-hard in general and its practical importance 
motivates numerous approximate algorithms. We propose a submodular relaxation approach (SMR) based on a Lagrangian 
relaxation of the initial problem. Unlike the dual decomposition approach of Komodakis et al. [30] SMR does not decompose the 
graph structure of the initial problem but constructs a submodular energy that is minimized within the Lagrangian relaxation. Our 
approach is applicable to both pairwise and high-order MRFs and allows to take into account global potentials of certain types. 
We study theoretical properties of the proposed approach and evaluate it experimentally. 

Index Terms —Markov random fields, energy minimization, combinatorial algorithms, relaxation, graph cuts 
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1 Introduction 

The problem of inference in a Markov random field 
(MRF) arises in many applied domains, e.g. in ma¬ 
chine learning, computer vision, natural language 
processing, etc. In this paper we focus on one impor¬ 
tant type of inference: maximum a posteriori (MAP) 
inference, often referred to as MRF energy minimiza¬ 
tion. Inference of this type is a combinatorial opti¬ 
mization problem, i.e. an optimization problem with 
the finite domain. 

The most studied case of MRF energy minimization 
is the situation when the energy can be represented as 
a sum of terms (potentials) that depend on only one 
or two variables each (unary and pairwise potentials). 
In this setting the energy is said to be defined by a 
graph where the nodes correspond to the variables 
and the edges to the pairwise potentials. Minimization 
of energies defined on graphs in known to be NP-hard 
in general [8] but can be done exactly in polynomial 
time in a number of special cases, e.g. if the graph 
defining the energy is acyclic [36] or if the energy is 
submodular in standard [28] or multi-label sense [10]. 

One way to go beyond pairwise potentials is to add 
higher-order summands to the energy. For example, 
Kohli et al. [23] and Ladicky et al. [32] use high-order 
potentials based on superpixels (image regions) for 
semantic image segmentation; Delong et al. [11] use 
label cost potentials for geometric model fitting tasks. 

To be tractable, high-order potentials need to have a 
compact representation. The recently proposed sparse 
high-order potentials define specific values for a short 
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list of preferred configurations and the default value 
for all the others [29], [37]. 

In this paper we develop the submodular relaxation 
framework (SMR) for approximate minimization of 
energies with pairwise and sparse high-order poten¬ 
tials. SMR is based on a Lagrangian relaxation of con¬ 
sistency constraints on labelings, i.e. constraints that 
make each node to have exactly one label assigned. 
The SMR Lagrangian can be viewed at as a pseudo- 
Boolean function of binary indicator variables and is 
often submodular, thus can be efficiently minimized 
with a max-flow/min-cut algorithm (graph cut) at 
each step of the Lagrangian relaxation process. 

In this paper we explore the SMR framework the¬ 
oretically and provide explicit expressions for the ob¬ 
tained lower bounds. We experimentally compare our 
approach against several baselines (decomposition- 
based approaches [30], [25], [29]) and show its appli¬ 
cability on a number of real-world tasks. 

The rest of the paper is organized as follows. In 
sec. 2 we review the related works. In sec. 3 we 
introduce our notation and discuss some well-known 
results. In sec. 4 we present the SMR framework. 
In sec. 5 we construct several algorithms within the 
SMR framework. In sec. 6 we present the theoretical 
analysis of our approach and in sec. 7 its experimental 
evaluation. We conclude in sec. 8. 

2 Related works 

We mention works related to ours in two ways. First, 
we discuss methods that analogously to SMR rely 
on multiple calls of graph cuts. Second, we mention 
some decomposition methods based on Lagrangian 
relaxation. 

Multiple graph cuts. The SMR framework uses the 
one-versus-all encoding for the variables, i.e. the indi¬ 
cators that particular labels are assigned to particular 
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variables, and runs the graph cut at the inner step. In 
this sense our method is similar to the o-expansion 
algorithm [8] (and its generalizations [22], [23]) and 
the methods for minimizing multi-label submodular 
functions [14], [10]. Nevertheless SMR is quite dif¬ 
ferent from these approaches. The o-expansion algo¬ 
rithm is a move-making method where each move 
decreases the energy. Methods of [14] and [10] are 
exacf methods applicable only when a total order on 
the set of labels is defined and fhe pairwise potentials 
are convex w.r.t. it (or more generally submodular 
in multi-label sense). Note that the frequently used 
Potts model is not submodular when the number of 
possible labels is larger than 2, so the methods of [14] 
and [10] are nof applicable to it. 

Jegelka and Bilmes [16], Kohli et al. [24] studied 
the high-order potentials that group lower-order po¬ 
tentials together (cooperative cuts). Similarly to SMR 
cooperative cut methods rely on calling graph cuts 
multiple times. As opposed to SMR these methods 
construct and minimize upper (not lower) bounds on 
the initial energy. 

The QPBO method [5] is a popular way of con¬ 
structing the lower bound on the global minimum 
using graph cuts. When the energy depends on binary 
variables the SMR approach obtains exactly the same 
lower bound as QPBO. 

Decomposition methods split the original hard prob¬ 
lem into easier solvable problems and try to make 
them reach an agreement on their solutions. One 
advantage of fhese methods is the ability not only 
to estimate the solution but also to provide a lower 
bound on the global minimum. Obtaining a lower 
bound is important because it gives an idea on how 
much better the solution could possibly be. If the 
lower bound is tight (the obtained energy equals 
the obtained lower bound) the method provides a 
certificate of optimality, i.e. ensures that the obtained 
solution is optimal. The t 5 rpical drawback of such 
methods is a higher computational cost in comparison 
with e.g. the n-expansion algorithm [18]. 

Usually a decomposition method can be described 
by the two features: a way to enforce agreement of the 
subproblems and type of subproblems. We are aware 
of fwo ways fo enforce agreemenf: message passing 
and Lagrangian relaxafion. Message passing was used 
in this context by Wainwright et al. [43], Kolmogorov 
[25] in the tree-reweighted message passing (TRW) 
algorithms. The Lagrangian relaxation was applied to 
energy minimization by Komodakis et al. [30] and 
became very popular because of its flexibility. The 
SMR approach uses the Lagrangian relaxation path. 

The majority of mefhods based on fhe Lagrangian 
relaxation decompose the graph defining fhe problem 
info subgraphs of simple structure. The usual choice 
is to use acyclic graphs (trees) or, more generally, 
graphs of low treewidfh [30], [2]. Komodakis et al. [30] 
pointed out the alternative option of using submod¬ 


ular subproblems buf only in case of problems wifh 
binary variables and fogefher wifh fhe decomposition 
of the graph. Strandmark and Kahl [42] used decom¬ 
position into submodular subproblems to construct 
a distributed max-flow/min-cut algorithm for large- 
scale binary submodular problems. Komodakis and 
Paragios [29] generalized the method of [30] fo work 
with the energies with sparse high-order potentials. 
In this method groups of high-order pofentials were 
puf info separafe subproblems. 

Several times the Lagrangian relaxation approach 
was applied to the task of energy minimization with¬ 
out decomposing the graph of fhe problem. Yarkony 
ef al. [46] created several copies of each node and 
enforced copies fo take equal values. Yarkony et al. 
[47] introduced planar binary subproblems in one- 
versus-all way (similarly to SMR) and used them in 
the scheme of tighfening the relaxation. 

The SMR approach is similar to the methods of [2], 
[29], [30], [42], [46], [47] in a sense fhat all of fhem 
use the Lagrangian relaxation to reach an agreement. 
The main advantage of fhe SMR is thaf the number 
of dual variables does not depend on the number of 
labels and the number of high-order potentials but is 
fixed fo fhe number of nodes. This structure results in 
the speed improvements over the competitors. 

3 Notation and preliminaries 

Notation. Let V be a finite set of variables. Each 
variable i G V takes values Xi G V where V is 
a finite set of labels. For any subsef of variables 
U C V the Cartesian product is the joint domain 
of variables C. A joint state (or a labeling) of variables 
C is a mapping xc : C ^ V and xc{i), i G C is fhe 
image of i under xq- We use to denote a set of all 
possible joint states of variables in C} A joint state of 
all variables in V is denoted by x (instead of xy)- If 
an expression confains bofh xq G and xa G V^, 
A, C CV we assume fhat the mappings are consistent, 
i.e. yi G A n C : xc{i) = XA{i)- 

Consider h 5 q)ergraph (V, C) where C C 2^ is a set of 
h 5 rperedges. A potential of hyperedge C is a mapping 
9c '■ —>■ R and a number 9c{xc) S R is fhe value 

of fhe potential on labeling xc- The order of pofential 
9c refers to the cardinality of sef C and is denofed by 
\C\. Pofentials of order 1 and 2 are called unary and 
pairwise correspondingly. Potentials of order higher 
fhan 2 are called high-order. 

An energy defined on hypergraph (V,C) (factoriz- 
able according to h 5 q)ergraph (V, C)) is a mapping 
E : V'^ -G R that can be represented as a sum of 
potentials of h 5 q)eredges C G C: 

E{x) = ^ 9c{xc). (1) 

cec 

1 . Note that pC jg highly related to 7^1 Cl "mapping 

terminology" is introduced to allow original variables i S C C V 
to be used as arguments of labelings: xc{i). 
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It is convenient to use notational shortcuts related 
to potentials and variables: ... ,Xi^) = 

= Oc,xc = Oc{xc), xc{i) = xc,i where 
C = {ii,..., ik} and i € C. 

When an energy is factorizable to unary and pair¬ 
wise potentials only we call it pairwise. We write 
pairwise energies as follows: 

E{x) = '^ei{xi) + ^ 9ij{xi,Xj). (2) 
ieV {i,j}GS 

Here f is a set of edges, i.e. hyperedges of order 2. 
Under this notation set C contains hyperedges of order 
1 and 2: C = {{i} | i G V} U {{i,j} \ {i,j} G S}. 

Sometimes it will be convenient to use indicators of 
each variable taking each value: yip = [xi = p\, i G V, 
p GV. Indicators allow us to rewrite the energy (1) as 

Eiiy) = E E Oc,dW_y%di. (3) 

iec 

Minimizing multilabel energy (1) over is equiva¬ 
lent to minimizing binary energy (3) over the Boolean 
cube {0, under so-called consistency constraints: 

= VfGV. (4) 

pev 

Throughout this paper, we use equation numbers 
to indicate a constraint given by the corresponding 
equation, for example y G (4). 

The Lagrangian relaxation approach. A popular ap¬ 
proximate approach to minimize energy (1) is to en¬ 
large the variables' domain, i.e. perform the relaxation, 
and to construct the dual problem to the relaxed one 
as it might be easier to solve. Formally, we get 

maxI?(A) < mmEib{y) < mini?(a;) (5) 

A y&y x^X 

where X and y are discrete and continuous sets, 
respectively, where X c y. En,{y) is a lower bound 
function defined on set y and D{\) a dual function. 
Both inequalities in (5) can be strict. If the left one is 
strict we say that there is a non-zero duality gap, if the 
right one is strict - there is a non-zero integrality gap. 
Note that if the middle problem in (5) is convex and 
one of the regularity conditions is satisfied (e.g. it is 
a linear program) there is no duality gap. The goal of 
all relaxation methods is to make the joint gap in (5) 
as small as possible. In this paper we will construct 
the dual directly from the discrete primal and will 
need the continuous primal only for the theoretical 
analysis. 

The linear relaxation of a pairwise energy consists 
in converting the energy minimization problem to an 
equivalent integer linear program (ILP) and in relaxing 
it to a linear program (LP) [40], [43]. 

Wainwright et al. [43] have shown that minimizing 
energy (2) over the discrete domain is equivalent to 
minimizing the linear function 

ELiyL) = J2J2 ^ipVip E E ^ij,pqyij,pq (^) 

icv pCV Ptq&V 


w.r.t. unary yip > 0 and pairwise yij,pq > 0 variables 
over the so-called marginal polytope: a linear polytope 
with facets defined by all marginalization constraints. 
The marginal polytope has an exponential number of 
facets, thus minimizing (6) over it is intractable. A 
polytope defined using only the first-order marginal¬ 
ization constraints is called a local polytope: 

= ViGV, (7) 

pev 

^ ^ yij,pq~yjqi ^ ^ yij,pq~yipi Vp,^ gt^. 

pGP q&V 

( 8 ) 

We refer to constraints (7) as consistency constraints and 
to constraints (8) as marginalization constraints. 

Minimizing (6) over local polytope (7)-(8) is often 
called a Schlesinger's LP relaxation [40] (see e.g. [44] 
for a recent review). In this paper we refer to this 
relaxation as the standard LP relaxation. 

4 SUBMODULAR RELAXATION 

In this section we present our approach. We start with 
pairwise MRFs (sec. 4.1) [35], generalize to high-order 
potentials (sec. 4.2) and to linear global constraints 
(sec. 4.3). A version of SMR for nonsubmodular La- 
grangians is discussed in the supplementary material. 

4.1 Pairwise MRFs 

The indicator notation allows us to rewrite the min¬ 
imization of pairwise energy (2) as a constrained 
optimization problem: 

min EE ^ipUip “1“ E E Qij.pqyipyjqi (9) 

^ iCV pCP {i,j}CS p.qCP 

s.t. Pip G {0,1}, yiGV,pGV, (10) 

^2/^P = l, ViGV. (11) 

pCV 

Constraints (51) (i.e. (4)) guarantee each node i to 
have exactly one label assigned, constraints (50) fix 
variables yip to be binary. 

The target function (49) (denoted by Ei{y)) of bi¬ 
nary variables y (i.e. (50) holds) is a pseudo-Boolean 
polynomial of degree at most 2. This function is easy 
to minimize (ignoring the consistency constraints (51)) 
if it is submodular, which is equivalent to having 
< 0 for all edges {i, j} G £ and all labels p,q gV. 
The so-called associative potentials are examples of 
potentials that satisfy these constraints. A pairwise 
potential 9ij is associative if it rewards connected nodes 
for taking the same label, i.e. 

% (P, q) = -Cij,p [p = q] = I ^ E’ 

1^0, p^q. 

where > 0. Note that if constants Cij^p are 

independent of label index p (i.e. Cij^p = Cij) then 
potentials (12) are equivalent to the Potts potentials. 
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Osokin et al. [35] observe that when Ej{y) is 
submodular consistency constraints (51) only prevent 
problem (49)-(51) from being solvable, i.e. if we drop 
or perform the Lagrangian relaxation of them we 
obtain an easily computable lower bound. In other 
words we can look at the following Lagrangian: 

(13) 

iev ^ 

As a function of binary variables y Lagrangian L{y, A) 
differs from Ej{y) only by the presence of unary 
potentials Xiyip and a constant, so Lagrangian L{y, A) 
is submodular whenever Ei{y) is submodular. 

Applying the Lagrangian relaxation to the prob¬ 
lem (49)-(51) we bound the solution from below 


min Ejiy) = min maxL('y,A) > 
yG(50),(51) yG(50) A 

max min L(y,\) = maxi9(A) (14) 

A yG{50) A 

where D{X) is a Lagrangian dual function: 

D{X) = min (Ei{y) + EE ^iVip )-E A,. (15) 

A p^-pi^y / igv 

The dual function D{X) is a minimum of a finite 
number of linear functions and thus is concave and 
piecewise-linear (non-smooth). A subgradient g G 
of the dual D{X) can be computed given the 
result of the inner pseudo-Boolean minimization: 

= E y^p ^ 

pen 


where {i/*,}f|v = argmin( L;/(y) -h E E A*J/*p 
y&(50) \ pen iev 

The concavity of D{X) together with the ability 
to compute the subgradient makes it possible to 
tackle the problem of finding the best possible lower 
bound in family (14) (i.e. the maximization of D{X)) 
by convex optimization techniques. The optimization 
methods are discussed in sec. 5. 

Note, that if at some point A the subgradient given 
by eq. (16) equals zero it immediately follows that 
the inequality in (14) becomes equality, i.e. there is 
no integrality gap and the certificate of optimality is 
provided. This effect happens because there exists a 
labeling y* delivering the minimum of L{y, A) w.r.t. y 
and satisfying the consistency constraints (51). 

In a special case when all the pairwise potentials 
are associative (of form (12)) minimization of the 
Lagrangian w.r.t. y decomposes into \V\ subproblems 
each containing only variables indexed by a particular 
label p: 


min L(y, A) 
2 /e( 50 ) ^ 


per 


min 

2/p6{0.1}l^l 


y^’pivpiX) 'y Xj 

iev 


where ^p{yp^ A) — E (^ip 1- Xfyip E Eij^pyipyjp. 
iev 

In this case the minimization tasks w.r.t yp can be 
solved independently, e.g. in parallel. 


4.2 High-order MRFs 

Consider a problem of minimizing the energy con¬ 
sisting of the high-order potentials (1). The uncon¬ 
strained minimization of energy E{x) over multi-label 
variables x is equivalent to the minimization of func¬ 
tion Ej{y) (3) over variables y under the consistency 
constraints (51) and the discretization constraints (50). 

For now let us assume that all the high-order po¬ 
tentials are pattern-based and sparse [37], [29], i.e. 


0c{xc) 


yc,d, if xc = d G Vc, 
0, otherwise. 


(17) 


Here set Vc contains some selected labelings of vari¬ 
ables Xc and all the values of the potential are non¬ 
positive, i.e. 6c,d < 0, d G Vc- In case of sparse 
potential the cardinality of set Vc is much smaller 
compared to the number of all possible labelings of 
variables xc, i.e \Vc\ ^ 

In this setting we can use the identity 

(- Dree yt) = min^Gto.i} (dCI - 1)^ - EreC 2/^^) 

to perform a reduction^ of the high-order func¬ 
tion Ei{y) to a pairwise function in such a way that 
minye{o,i}* Ei{y) = mm2G{o,i}*. 3/e{o,i}* Epy, z) 
where by {0,1}* we denote the Boolean cubes of 
appropriate sizes and 


EPy,z)=-Y, E ^cM\C\-l)zc,d- 

CeCdGVc ^ . 

'^yed(Zc,dj- (18) 

eec ' 

Function Ej {y, z) is pairwise and submodular w.r.t. 
binary variables y and 2 : and thus in the absence of 
additional constraints can be efficiently minimized by 
graph cuts [28]. Adding and relaxing constraints (51) 
to the minimization of (3) gives us the following 
Lagrangian dual: 


D(X) = min L{y,z,X) = 
j/.ze{o,i}* 


mm 

i/.ze{o,i}* 


E*i (y, 2 ) + ^ aJ ^ - 1 

iev \&v 


(19) 


For all A Lagrangian L{y, z,X) is submodular w.r.t. 
y and 2 allowing us to efficiently evaluate D at 
any point. Function D{X) is a lower bound on the 
global minimum of energy (1) and is concave but non¬ 
smooth as a function of A. Analogously to (15) the 
best possible lower bound can be found using convex 
optimization techniques. 

Robust sparse pattern-based potentials. We can gen¬ 
eralize the SMR approach to the case of robust sparse 
high-order potentials using the ideas of Kohli et al. [23] 


2. This transformation of binary high-order function (3) is in fact 
equivalent to a special case of Type-II binary transformation of [37], 
but in this form it was proposed much earlier (see e.g. [15] for a 
review). 
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and Rother et al. [37]. Robust sparse high-order po¬ 
tentials in addition to rewarding the specific labelings 
d € Vc also reward labelings where the deviation 
from d G Vc is small. We can formulate such a term 
for hyperedge C in the following way: 

Oc{xc) = X! minfo, 0c,d Wi[xt^diU (20) 

deUc ^ 

or equivalently in terms of indicator variables y 

Ocivc) = ™nfo, 0c,d +'^wf {1 - yid,)] ■ (21) 
deuc ^ iec ' 

The irmer min of (21) can be transformed fo pairwise 
form via adding switching variables zc a (analogous 
to (18)): 

min 0 c,dZC,d + wf zc,d(l - 

^C,d 6 { 0 . 1 } ^ 


This function is submodular w.r.t. variables y and 2 : 
if coefficients w‘^ are normegative. Introducing La¬ 
grange multipliers and maximizing the dual function 
is absolutely analogous to the case of non-robust 
sparse high-order pofentials. 

One particular instance of robust pattern-based 
potentials is the robust P”-Potts model [23]. Here 
the cardinality of sefs Vc equals fhe number of the 
labels \V\ and all the selected labelings correspond to 
the uniform labelings of variables xc'- 


(^c{xc) = y] min| 0 , -7 -|- 
pev 


E 

iec 


-f/Q[xi^p]]. ( 22 ) 


Here 7 > 0 and Q G (0, |C|] are the model parameters. 
7 represents the reward that energy gains if all the 
variables xc take equal values. Q stands for the 
number of variables fhat are allowed fo deviate from 
fhe selecfed label p before the penalty is truncated.^ 
Positive values of 9c- Now we describe a simple way 
fo generalize our approach to the case of general high- 
order potentials, i.e. potentials of form (17) buf with 
values 0 c,d allowed to be positive. 

First, note that constraints (50) and (51) im¬ 
ply that for each hyperedge C G C equality 
n^ec y^pe = 1 holds. Because of fhis facf 
we can subfract fhe constant Me = max^eUc ^c{d) 
from all values of potential 0 c simultaneously without 
changing the minimum of the problem, i.e. 


min EAy) = min y 
ye(50),(51) yG(50),(51) ^ 


Me 


pe-pc 


Qc{p) n y^P!^ 

e&c 


where 9c{p) = dc{p) — Me- We have all 9c{p) non¬ 
positive and therefore can apply the SMR technique 


3. In the original formulation of the robust P"-Potts model [23] 
the minimum operation was used instead of the outer sum in (22). 
The two schemes are equivalent if the cost of the deviation from the 
selected labelings if high enough, i.e. Q < 0.5|C|. The substitution 
of minimum with summation in similar context was used by Rother 
et al. [37]. 


described at the beginning of section 4.2. Note, that 
in this case the number of variables 2 : required to 
add within the SMR approach can be exponential 
in the arity of h 5 q)er edge C. Therefore the SMR is 
practical when either the arity of all h 5 rperedges is 
small or when there exists a compact representation 
such as (17) or (20). 


4.3 Linear global constraints 

Linear constraints on the indicator variables form an 
important type of global constraints. The following 
constraints are special case of this class: 

7 Vjp = c, strict class size constraints, (23) 

E gv ^ [ci) C 2 ]) soft class size constraints, 

E^-ev ^ E^-ev equalities, 

E,gv E,gv y^p’ 

Here Ij is the observable scalar or vector value as¬ 
sociated with node j, e.g. intensity or color of the 
corresponding image pixel. There have been a number 
of works that separately consider constraints on total 
number of nodes that take a specific label (e.g. [31]). 

The SMR approach can easily handle general linear 
constraints such as 


E E ^^yip = c""’ m = 1 ,..., M, (24) 

ievpev 

J2J2<y^p^d\ k = i,...,K. (25) 

iGV peP 

We introduce additional Lagrange multipliers ^ = 
{Cm}m=i,...,M, tt = {Trk}k=i,...,K for constraints (24), 
(25) and get the Lagrangian 


L{y, \ I, tt) =Ei{y) + Ad 2 /ip - 1 

iGV \gV 


M 




m—l ^iGV pGV 


k=i ^iev pev 



where inequality multipliers tt are constrained to 
be normegative. The Lagrangian is submodular w.r.t. 
indicator variables y and thus dual function 


D{X, I, tt) = min L{y, A, tt) (26) 

J/G(50) 

can be efficiently evaluated. The dual function is 
piecewise linear and concave w.r.t. all variables and 
thus can be treated similarly to the dual functions 
introduced earlier. 



5 Algorithms 

In sec. 5.1 we discuss the convex optimization meth¬ 
ods that can be used to maximize the dual func¬ 
tions D{X) constructed within the SMR approach. In 
sec. 5.2 we develop an alternative approach based on 
coordinate-wise ascent. 

5.1 Convex optimization 

We consider several optimization methods that might 
be used to maximize the SMR dual functions: 

1) subgradient ascent methods [30]; 

2) bundle methods [17]; 

3) non-smooth version of L-BFGS [33]; 

4) proximal methods [48]; 

5) smoothing-based techniques [39], [48]; 

6 ) stochastic subgradient methods [30]. 

First, we begin with the methods that are not suited 
to SMR: proximal, smoothing-based and stochastic. 
The first two groups of methods are not directly appli¬ 
cable for SMR because they require not only the black¬ 
box oracle that evaluates the function and computes 
the subgradient (in SMR this is done by solving the 
max-flow problem) buf requesf addifional computa- 
fions: proximal mefhods - fhe prox-operator, log-sum- 
exp smoothing [39] - solving the sum-product prob¬ 
lem instead of max-sum, quadratic smoothing [48] 
requires explicit computation of convex conjugates.'^ 
We are not aware of possibilities of computing any of 
fhose in case of fhe SMR approach. 

Stochastic techniques have proven themselves to 
be highly efficient for large-scale machine learning 
problems. Bousquet and Boftou [ 6 , sec. 13.3.2] show 
fhat sfochasfic methods perform well within the 
estimation-optimization tradeoff but are not so great 
as optimization methods for the empirical risk (i.e. the 
optimized function). In case of SMR we do nof face 
fhe machine learning fradeoffs and thus stochastic 
subgradient is not suitable. 

Further in this section we give a short review of 
methods of the three groups (subgradient, bundle, L- 
BFGS) that require only a black-box oracle that for 
an arbitrary value of dual variables A compufes fhe 
function value D{\) and one subgradient dD{\). 

Subgradient methods at iteration n perform an 
update in the direction of fhe currenf subgradienf 

A"+i = A"-ha"(9D(A”). 

with a step size a". Komodakis et al. [30] and Kappes 
et al. [17] suggest the following adaptive scheme of 

4. Zach et al. [48] directly construct the smooth approximation /e 
of a convex non-smooth function / by /e = (/* -|-e|| ■ |||/2)* where 
e > 0 is a smoothing parameter and (•)* is the convex conjugate. 
In their case function / is a sum of summands of simple form for 
which analytical computation of convex conjugates is possible. In 
our case each summand is a min-cut LP, its conjugate is a max- 
flow LP. The addition of quadratic terms to /* makes fe quadratic 
instead of a min-cut LP and thus efficient max-flow algorithms are 
no more applicable. 


Input: Lagrangian L{y, A) (13), initialization Ag; 
Output: A* = argmax_)^ min^ L{y, A); 

1 : A := Aq; 

2 : repeat 

3: y* := argmin^^ L(y, A); // run max-Eow 

4: D(\)=L{y*\); fj evaluate the function 

5: for all i gV do 

6 : 9i//compute the subgradient 

pev 

7: if y = 0 then 

8 : return A; // the certificate of optimality 

9: else 

10 : estimate the primal; / / see sec. 5.3 

11 : choose stepsize a; // e.g. using rule (27) 

12 : A := A -I- ag; // make the step 

13: until convergence criteria are met 

14: return A; 

Fig. 1. Subgradient ascent aigorithm to maximize the 
SMR duai (15). 


selecting the step size: 

71" - i9(A») 

“ ^ wdDix-m 


(27) 


where 7l" is the current estimate of the optimum (the 
best value of the primal solution found so far) and 7 
is a parameter. Fig. 1 summarizes the overall method. 

Bundle methods were analyzed by Kappes et al. 
[17] in the context of MRF energy minimization. A 
bundle B is a collection of points A', values of dual 
function D{X'), and sungradients dD{X'). The next 
point is computed by the following rule: 


A"+i 


argmax 

A 





(28) 


where D{X) is the upper bound on the dual function: 
b{X)= min (£i(A')-F (dD(A'),A-A')). 


Parameter w" is intended to keep A"+^ close to 
the current solution estimate A. If the current step 
does not give a significant improvement than the 
null step is performed: bundle B is updated with 
(A"+^, Z1(A"+^), dIl(A"+^)). Otherwise the serious step 
is performed: in addition to updating the bundle 
the current estimate A is replaced by Kappes 

et al. [17] suggest to use the relative improvement of 
Z1(A"+^) and £)(A"+^) over D{X) (threshold tol) to 
choose the type of the step. Inverse step size w" can 
be chosen adaptively if the previous step was serious: 

n ^ p f ^ - maxfc^i^...,nT)(A ) \ 

V \\dD{X^)\\2 ) 

(29) 

where is the projection on the segment. 

In case of the null step w" = In summary, 

the bundle method has the following parameters: 7 , 
Wmin, Wniax/ niL, and maximum size of the bundle - 





7 


Input: Lagrangian L{y, A) (13), initialization Ag; 
Output: coordinate-wise maximum of D{\); 

1: A := Ao; 

2 : repeat 

3: converged := true; 

4: for all j G V do 

5: for a\\ p do 

6 : compute MMjpfiL{y,\), MMjp^iL{y, X); 

7: SI := MM,p,o L[y, A) - L(y, A); 

8 : find and 

9: if and 5 ^ 2 ) of the same sign then 

10: Xj := Xj + 0.5((5(^^ -I- i5(2)); 

11 : converged := false; 

12 : until converged ^ true; 

13: return A; 

Fig. 2. Coordinate ascent aigorithm for maximizing 
the duai D{X) (15) in case of pairwise associative 
potentiais. 


bg. Note that the update (28) implicitly estimates both 
the direction g” and the step size a”. 

Another algorithm analyzed by Kappes et al. [17] 
is the aggregated bundle method proposed by Kiwiel 
[19] with the same step-size rule (29). The method 
has the following parameters: 7 , Wmin/ Wmax/ and a 
threshold to choose null or serious step - irir- Please 
see works [17], [19] for details. We have also tried 
a combined strategy: LMBM method® [13] where we 
used only one non-default parameter: the maximum 
size of a bundle bg. 

L-BFGS methods choose the direction of the update 
using the history of function evaluations: S'"dZl(A") 
where S'" is a negative semi-definite estimate of the 
inverse Hessian at A". The step size a" is typically 
chosen via approximate one-dimensional optimiza¬ 
tion, a.k.a. line search. Lewis and Overton [33] pro¬ 
posed a specialized version of L-BFGS for non-smooth 
functions implemented in HANSO library.® In this 
code we varied only one parameter: the maximum 
rank of Hessian estimator hr- 

5.2 Coordinate-wise optimization 

An alternative approach to maximize the dual func¬ 
tion D{X) consists in selecting a group of variables 
and finding the maximum w.r.t them. This strategy 
was used in max-sum diffusion [44] and MPLP [12] 
algorithms. In this section we derive an analogous 
algorithm for the SMR in case of pairwise associative 
potentials. The algorithm is based on the concept of 
min-marginal averaging used in [43], [25] to optimize 
the TRW dual function. 

Consider some value A°*'^ of dual variables A and 
a selected node j € V. We will show how to construct 

5. http:/ /napsu.karmitsa.fi/lmbm/lmbmu/lmbm-mex.tar.gz 

6. http:/ /www.cs.nyu.edu/overton/software/hanso/ 


a new value A"'''™ such that i9(A"®"’) > D{X°^‘^). The 
new value will differ from the old value only in one 
coordinate Xj, i.e. A”®*" = Xf^, i ^ j. 

Consider the unary min-marginals of Lagrangian (13) 

= min L(y,A°'‘^) 
y6(50). Vjp=k 


where k € {0,1}, p € V. The min-marginals 

can be computed efficiently using the dynamic 
graphcut framework [21]. Denote the differences of 
min-marginals for labels 0 and 1 with 6^: S^j, = 
MMjppL{y,X°^^) - MM,p,iL(y,A°'‘^). Let be 
the maximum of 6^, 5 ^ 2 ) “ l^e next maximum; 
p^^^ and - the corresponding indices: pj^'^ = 

argmaxpgp (5^, (5^^^ = pf^ = argmax^g.p^^p(i) j <5^, 

Pj j 

( 5 ^ 2 ) = Let us construct the new value A"®™ of 
dual variable Xj with the following rule: 

Af “ = Xf‘^ + Aj (30) 


where Aj is a number from segment 


S^ S^ 


Theorem 1. Rule (30) when applied to a pairwise energy 
with associative pairwise potentials delivers the maximum 
of function D{X) w.r.t. variable Xj. 


The proof relies on the intuition that rule (30) di¬ 
rectly assigns the min-marginals MMjp^kL {y, A), p G 
V, k G {0,1} such values that only one subproblem p 
has yj^p equal to 1. We provide the full proof in the 
supplementary material. 

Theorem 1 allows us to construct the coordinate- 
ascent algorithm (fig. 2). This result cannot be ex¬ 
tended to the cases of general pairwise or high-order 
potentials. We are not aware of analytical updates 
like (30) for these cases. The update w.r.t. single vari¬ 
able can be computed numerically but the resulting 
algorithm would be very slow. 


5.3 Getting a primal solution 

Obtaining a primal solution given some value of the 
dual variables is an important practical issue related 
to the Lagrangian relaxation framework. There are 
two different tasks: obtaining the feasible primal point 
of the relaxation and obtaining the final discrete solu¬ 
tions. The first task was studied by Savchynskyy et al. 
[39] and Zach et al. [48] for algorithms based on the 
smoothing of the dual function. Recently, Savchyn¬ 
skyy and Schmidt [38] proposed a general approach 
that can be applied to SMR whenever a subgradient 
algorithm is used to maximize the dual. The second 
aspect is usually resolved using some heuristics. Os- 
okin and Vetrov [34] propose an heuristic method to 
obtain the discrete primal solution within the SMR 
approach. Their method is based on the idea of block- 
coordinate descent w.r.t. the primal variables and is 
similar to the analogous heuristics in TRW-S [25]. 




6 Theoretical analysis 

In this section we explore some properties of the 
maximum points of the dual frmctions and express 
their maximum values explicitly in the form of LPs. 
The latter allows us to reason about tightness of 
the obtained lower bounds and to provide analytical 
comparisons against other approaches. 

6.1 Definitions and theoreticai properties 

Denote the possible optimal labelings of binary vari¬ 
able Uip associated with node i and label p given 
Lagrange multiplyers A with 

ZzpW = {z I 3y^ e ArgminL(y, A) : = z} 

yG{50) 

where L(y, A) is the Lagrangian constructed within 
the SMR approach (13). 

Definition 1. Point A satisfies the strong agreement 
condition if for all nodes i, for one label p set Zip{\) 
equals {1} and for the other labels it equals {0}; 

VzeV 3\p€V : Z,pi\) = {1}, Wq^p Z,g(X) = {0}. 

This definition implies that for a strong agreement 
point A there exists the rmique binary labeling y 
consistent with all sets Zip{\) {yip G Zip{\)) and 
with consistency constraints (51). Consequently, such 
y defines labeling x £ V'^ that delivers the global 
minimum of the initial energy ( 1 ). 

Definition 2. Point A satisfies the weak agreement 
condition if for all nodes i gV both statements hold true: 

1) 3p G T’ '■ 1 G Zip{X). 

2) Vp G iP : Z,p{X) = {1} ^ V9 ^ p, 0 G Z,,(A). 

This definition means that for a weak agreement 
point A there exists a binary labeling y consistent 
with all sets Zip{X) and consistency constraints (51). 
It is easy to see that the strong agreement condition 
implies the weak agreement condition. 

Denote a maximum point of dual function D{X) (15) 
by A* and an assignment of the primal variables 
that minimize Lagrangian L{y,X*) by y*. We prove 
the following property of A* in the supplementary 
material. 

Theorem 2. A maximum point X* of the dual func¬ 
tion D{X) satisfies the weak agreement condition. 

In general the three situations are possible at the 
global maximum of the dual function of SMR (see 
fig. 3 for illustrations): 

1) The strong agreement holds, binary labeling y* 
satisfying consistency constraints (51) (thus la¬ 
beling a;* as well) can be easily reconstructed, 
there is no integrality gap. 

2) Some (maybe all) sets Zip{X*) consist of multiple 
elements, x* consistent with (51) is non-trivial to 
reconstruct, but there is still no integrality gap. 


D(A) D(X) D(\) 


(1) ^ (2) ^ (3) ^ 

Fig. 3. The three possible cases at the maximum of the 
SMR dual function. The optimal energy value is shown 
by the horizontal dotted line. Solid lines show faces of 
the hypograph of the dual function D{X). 

3) Some sets Zip{X*) consist of multiple elements, 
labeling y* that is both consistent with (51) and 
is the minimum of the Lagrangian does not exist, 
the integrality gap is non-zero. 

In situations 1 and 2 there exists a binary label¬ 
ing y* that simultaneously satisfies consistency con¬ 
straints (51) and delivers the minimum of the La¬ 
grangian L{y,X*). Labeling y* defines the horizontal 
hyperplane that contains point {X*, D{X*)). 

In situation 1 labeling y* is easy to find because the 
subgradient defined by (16) equals zero. Situation 2 
can be trickier but in certain cases might be resolved 
exactly. Some examples are given in [35]. 

In practice the convex optimization methods de¬ 
scribed in sec. 5.1 t 5 q)ically converge to the point 
that satisfies the strong agreement condition rather 
quickly when there is no integrality gap. In case of 
the non-zero integrality gap all the methods start to 
oscillate around the optimum and do not provide a 
point satisfying even the weak agreement condition. 
The coordinate-wise algorithm (fig. 2) often converges 
to points with low values of the dual function (there 
exist multiple coordinate-wise maximums of the dual 
function) but allows to obtain a point that satisfies the 
weak agreement condition. This result is summarized 
in theorem 3 proven in the supplementary material. 

Theorem 3. Coordinate ascent algorithm (fig. 2) when 
applied to a pairwise energy with associative pairwise 
potentials returns a point that satisfies the weak agreement 
condition. 

6.2 Tightness of the lower bounds 

6.2.1 Pairwise associative MRFs 

Osokrn et al. [35] show that in the case of pairwise 
associative MRFs maximizing Lagrangian dual (15) is 
equivalent to the standard LP relaxation of pairwise 
energy (2), thus the SMR lower bound equals the ones 
of competitive methods such as [30], [17], [39] and in 
general is tighter than the lower bounds of TRW-S [25] 
and MPLP [12] (TRW-S and MPLP can get stuck at 
bad coordinate-wise maximums). In addition Osokrn 
et al. [35] show that in the case of pairwise associative 
MRFs the standard LP relaxation is equivalent to 
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Kleinberg-Tardos (KT) relaxation of a uniform metric 
labeling problem [20]. In this section we reformulate 
the most relevant result for fhe sake of completeness. 
The proof will be given lafer as a corollary of our more 
general result for high-order MRFs (theorems 5 and 
6 ). 

Theorem 4. In case of pairwise MRFs with associative 
pairwise potentials (2), (12) the maximum of Lagrangian 
dual (15) equals the minimum value of the standard LP 
relaxation ( 6 ), (7), ( 8 ) of the discrete energy minimization 
problem. 

6.2.2 High-order MRFs 

In this section we state the general result, compare 
it to known LP relaxations of high-order terms and 
further elaborate on one important special case. 

The following fheorem explicifly defines fhe LP 
relaxafion of energy (1) fhat is solved by SMR. 

Theorem 5. In case of high-order MRF with sparse 
pattern-based potentials (17) the maximum of the SMR 
dual function D{X) (19) equals the minimum value of the 
following linear program: 


min 

y 


E ^c,dyc,ci (31) 

CcCdG-Dc 

s.t. yip,yc,d e [0,1], Vf gV, 'ipe'P, VCgC, Md^Vc, 

(32) 

yc,d < yide, VC G C, \/d G Vc, Vf G C, (33) 
Ev*p = i> (34) 

pev 

Proof: We start the proof with the two lemmas. 

Lemma 1. Let yi,... ,yK be real numbers belonging to 
the segment [0,1], K > 1. Consider the following linear 
program: 

K 

min z{K — 1) — Zk (35) 

z,zi,...,ZKe[o,i] 

k—1 

s.t. Zk <z,Zk< yk- (36) 

Point z = zi = ■ • ■ = Zk = rninfe^i yk delivers the 
minimum of LP (75)-(76). 

Lemma 1 is proven in the supplementary material. 
Lemma 2. Consider the convex optimization problem 

(37) 

s.t. Ay = b, (38) 


min f{y), 
yey 


Lemma 2 is a direct corollary of fhe strong dual¬ 
ity theorem with linearity constraint qualification [3, 
Th. 6.4.2]. 

Let us finish the proof of theorem 5. All values of 
fhe potenfials are non-positive 6c,d < 0 so lemma 1 
implies that the LP (31)-(34) is equivalent to the 
following LP: 

™ “E E ^Cd((|C'l - l)zc.d-E4.d) (39) 

C&CdC'Dc ^ l(SC ^ 

S.t. zS.d G [0>1]. VCgC, ydGVc, Vice, (40) 

y^p,zc,d G [0,1], VzGV, VpGP, VCgC, 'idGVc, 

zc,d ^ ^c,d, zc,d yidijC gC, 'idG'Dc, VfGC, 

(41) 

Y,y^p = ^. VfGV. (42) 

pCV 

Denote the target function (39) by Q{y, z). Let 

^ i£V 

Task (39), (40)-(41) is equivalent to the standard 
LP-relaxation of funefion (18) of binary variables y 
and 2 ; (see lemma 3 in the supplementary mate¬ 
rial). As 9c,d < 0/ function (18) is submodular and 
thus its LP-relaxation is tight [27, Th. 3], which 
means that i?(A) = D{X) for an arbifrary A. Func¬ 
tion f{y) = min 2 g( 40 )_( 4 i)( 5 (y, 2 :), set y defined 
by yip G [0,1], matrix A and vector b defined by (42) 
satisfy the conditions of lemma 2 which implies that 
the value of solution of LP (39)-(42) equals i?(A*), 
where A* = argmax;,^ i?(A). The proof is finished by 
equality R{X*) = D{X*). □ 

Corollary 1. For the case of pairwise energy (2) with 
bij,pq < 0 the SMR returns the lower bound equal to the 
solution of the following LP-relaxation: 


mm 

y 


where T" C R” is a convex set, A is a real-valued matrix 
of size n X m, b G M™, / : K" —>■ K is a convex function. 
Let set y contain a non-empty interior and the solution of 
problem (37)-(38) be finite. 

Then the value of the solution of (37)-(38) equals 
max min L{y, A) where L{y, A) = f{y) + X^{Ay — b). 


in EE ^ipUip H” E E ^ij^pqUij^pq (^3) 

ieVpeV p,g&'P 

s.t. yip G [0,1], Vf G V, Vp G V, 

yij,pq G [0,1], V{f, j} G £, Vp,g G V, 

yij,pq ^ J/ip, Vij.pq ^ bjqj V{f, j} G 6 ^, Vp, G 7^, 

(44) 

Epep,Vip = l’ (45) 

Permuted 7^"-Potts. Komodakis and Paragios [29] 
formulate a generally tighter relaxation than (31)-(34) 
by adding constraints responsible for marginalization 
of high-order potentials w.r.t all but one variable: 

E Vc.p = yip,, VC G C, Vpo G P, Vf G C. (46) 

pcVZ-. pt=po 

In this paragraph we show that in certain cases the 
relaxation of [29] is equivalent to (31)-(34), i.e. con¬ 
straints (46) are redundant. 

Definition 3. Potential 9c is called permuted V'^-Potts if 
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VC G C Vd', d" £Vc yieC: d' ^d" d''. 

In permuted 7^"-Potts potentials all preferable con¬ 
figurations differ from each ofher in all variables. The 
7^"-Potts pofential described by Kohli ef al. [22] is a 
special case. 

Theorem 6. If all higher-order potentials are permuted 
V^-Potts, then the maximum of dual function (19) is equal 
to the value of the solution of LP (31)-(34), (46). 

Proof: First, all values 9c,d are non-positive so 
variables yc,p take maximal possible values. This 
implies that equality in (46) can be substituted with 
inequality. Second, for a permuted P”-Potts potential 
9c for each £ € C and for each po G V there 
exists no more than one labeling d G Vc such that 
di — po which means that all the sums in (46) can 
be substituted by the single summands. These two 
observations prove that (46) is equivalent to (33). □ 

The LP result for fhe associative pairwise MRFs 
(fheorem 4) immediately follows from theorem 6 and 
corollary 1 because all associative pairwise poten¬ 
tials (12) are permuted 7^"-Potts and condition (46) 
reduces to marginalization constraints (8). 

6.2.3 Global linear constraints 

Theorem 7. The maximum of the dual 
function D{X, tt) (26) (constructed for the minimization 
of pairwise energy (2) under global linear constraints (24) 
and (25)) under constraints Wk > 0 is equal to the value 
of the solution of LP relaxation (67)-(69) with addition of 
the global linear constraints (24) and (25). 

The proof is similar to the proof of fheorem 5 but 
requires including extra terms in the Lagrangian. The¬ 
orem 7 generalizes to the case of high-order pofentials. 

7 Experimental evaluation 

7.1 Pairwise associative MRFs 

In fhis section we evaluate the performance of differ- 
enf optimization mefhods discussed in sec. 5.1 and 
compare the SMR approach against the dual decom¬ 
position on trees used in [30], [17] (DD TRW) and 
TRW-S method [25]. 

Dataset. To perform our tests we use the MRF in¬ 
stances created by Alahari et al. [1].^ We work with 
instances of fwo t 5 rpes: stereo (four MRFs) and seman¬ 
tic image segmenfation (five MRFs). All of fhem are 
pairwise MRFs wifh 4-connectivity system and Potts 
pairwise potentials. The instances of these two groups 
are of significanfly different sizes and thus we report 
the results separately. 

Parameter choice. For each group and for each op¬ 
timization mefhod we use grid search fo find the 
optimal method parameters. As a criterion we use the 

7. http://www.di.ens.fr/~alahari/data/pamilOdata.tgz 


TABLE 1 

The optimal parameter values chosen by the grid 
search in sec. 7.1. 



Method 

Segmentation 

stereo 


subgradient 

7 = 0.3 

7 = 0.1 

H 

Q 

bundle 

7=0.01, mx,=0.05, 

U'max = 1000 

7=0.01, mx,=0.05, 

U’max=2000 

Q 

aggr 

bundle 

7=0.02, t«max = 100, 

mr=0.0005 

7=0.01, 771^=0.002, 
'^^max—500 


subgradient 

7 = 0.7 

7 = 0.3 

SMR 

bundle 

7=0.1, mL=0.2, 

—100 

7=0.01, mL=0.1, 

'^^max —100 


aggr 

bundle 

7=0.02, 771^=0.001, 

—500 

7=0.02, 771t^=0.001, 

t(^max = 1000 


maximum value of lower bound obtained within 25 
seconds for segmenfation and 200 seconds for sfereo. 
For fhe bundle mefhod we observe fhaf larger bundles 
lead to better results, but increase the computation 
time. Value bs = 100 is a compromise for both stereo 
and segmentation datasets. The same reasoning gives 
us bs = 10 for LMBM and hr = 10 for L-BFGS. 
Paramefer Wmin for (aggregafed) bundle mefhod did 
not affect the performance so we used 10~^° as in [17]. 
The remaining paramefers are presenfed in table 1. 
Implementation details. For L-BFGS and LMBM we 
use the original authors' implementations. All the 
other optimization routines are implemented in MAT- 
LAB. Both SMR and DD TRW oracles are imple¬ 
mented in C++: for SMR we used the d 5 mamic version 
of BK max-flow algorifhm [7], [21]; for DD TRW 
we used fhe domestic implementation of dynamic 
programming speeded up specifically for Potfs model. 
To esfimafe the primal solution at the current point 
we use 5 iterations of ICM (Iferafed Conditional 
Modes) [4] ever 5 rwhere. We normalize all the energies 
in such a way that the energy of the o-expansion 
result corresponds to 100, the sum of minimums of 
unary pofentials - to 0. All our codes are available 
online.® 

Results. Fig. 4 shows the plots of fhe averaged 
lower bounds vs. time for stereo and segmentation 
instances. See fig. 7 in the supplementary material 
for more detailed plots. We observe that SMR outper¬ 
forms DD TRW for almost all optimization methods 
and for the segmentation instances it outperforms 
TRW-S as well. When o-expansion was nof able fo 
find fhe global minimum of fhe energy (1 segmenfa- 
fion instance and all stereo instances) all the tested 
methods obtained primal solutions with lower ener¬ 
gies than the energies obtained by u-expansion. 

7.2 High-order potentials 

In this section we evaluate the performance of fhe 
SMR approach on the high-order robust P”-Potts 

8. https://github.com/aosokin/submodular-relaxation 
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A SMR subgradient 
—■— SMR aggr. bundle 
A DD TRW subgradient 
—■— DD TRW aggr. bundle 


-•-SMR bundle 
—♦—SMR LMBM 
-•-DD TRW bundle 
—♦— DD TRW LMBM 


-▼-SMR L-BFGS 

-▼- DD TRW L-BFGS 
TRW-S 



Fig. 4. The results of experiment 7.1 comparing SMR, 
DD TRW (both with different optimization routines) and 
TRW-S on segmentation and stereo datasets. 


model proposed by Kohli et al. [23] for the task of 
semantic image segmentation. 

Dataset. We use the two models both constructed 
for the 21-class MSRC dataset [41]. The segmentation 
of an image is constructed via the minimization of 
the energy consisting of unary, pairwise and high- 
order potentials. The parameters of the unary and 
the pairwise potentials are obtained on the standard 
training set via the piecewise training procedure. The 
unary and pairwise potentials are based on the Tex- 
tonBoost model [41] and are computed by the ALE 
library [32].® The unary potentials are learned via 
a boosting algorithm applied to a number of local 
features. The pairwise potentials are contrast-sensitive 
Potts based on the 8-neighborhood grid. The high- 
order potentials are robust 7^”-Potts (22) and are based 
on the superpixels (image regions) computed by the 
mean shift segmentation algorithm [9]. 

The two models differ in the number of segments 
used to construct the high-order potentials. In the 
first one we use segments produced by one mean- 
shift segmentation with the bandwidth parameters 
set to 7 for the spatial domain and 9.5 for the LUV- 
colour domain. In the second model we use the three 
segmentations computed with the following pairs of 
parameters: (7,6.5), (7,9.5), (7,14.5) - the first number 
corresponds to the spatial domain, the second one - 
to the colour domain. In both models for potential 9c 
based on a set of nodes C C V parameters Q and 
7 are selected as recommended by Kohli et al. [23]: 
Q = 0.1|C7|, 7 = |C'|O-®(0.2 -i 0.5G(C')) where G(G) = 
exp(-12||i^,gc(/i-M)^ll/|C'l)/M = (Ei6c/*)/|C'l and 

fi is the color vector in the RGB format: fi S [0,1]? 
We refer to the two models as 1-seg and 3-seg models. 
Baseline. As a baseline we choose a "generic opti¬ 
mizer" by Komodakis and Paragios [29]. We imple¬ 
ment it as a dual decomposition method with the 
following decomposition of the energy: all pairwise 
potentials are split into vertical, horizontal, and diag¬ 
onal (2 directions) chains, each high-order potential 
forms a separate subproblem, unary potentials are 

9. http://www.inf.ethz.ch/personal/ladickyl/ALE.zip 


-SMR dual —•— SMR primal -CWD dual —•— CWD primal 



Oracle calls 

(a): 1 set of superpixels 



Fig. 5. Energies/lower bounds produced by SMR and 
CWD on the image segmentation models. All blue 
curves correspond to CWD, red curves - to SMR. Plots 
(a) and (b) correspond to the 1-seg and 3-seg models 
respectively. For each point of each plot we report the 
results aggregated across the standard test set: the 
medians (solid lines), the lower and the upper quartiles 
(dashed lines). This means that for each plot 50% of all 
curves pass between the dashed lines. 


evenly distributed between the horizontal and the 
vertical forests. We refer to this method as CWD. 

Within the CWD method the dual function depends 
on |V||7^|(3 -I- h) variables. Here h is the average 
number of high-order potentials incident to individual 
nodes. For the 1-seg model h equals 1 and for the 
3-seg model h = 3. The dual function in the SMR 
method depends on |V| variables. In our experiments 
the CWD's duals have 5725440 and 8588160 variables 
for the 1-seg and 3-seg models respectively, the SMR's 
duals have 68160 variables in both models. We used 
the L-BFGS [33] method with step size tolerance of 
10“^ and maximum number of iterations of 300 as 
the optimization routine. 

Results. Fig. 5 reports the results of the CWD and 
SMR methods applied to both segmentation models 
on the test set of the MSRC dataset (256 images). 
For each number of oracles calls we report the lower 
bounds and the energies. Lower bounds are the values 
of the dual functions D{X) obtained at the correspond¬ 
ing oracle calls. Energies are computed for the label¬ 
ings obtained by assigning the first possible labels to 
the nodes where subproblems disagree. To aggregate 
the results we subtract from each energy the constant 
such that the best found lower bound equals zero. 

We observe that SMR converges faster than CWD 
in terms of the value of the dual and obtains better 
primal solutions. Note, that the more sophisticated 
version of CWD, PatB [29], could potentially improve 
the performance only when the high-order potentials 
intersect. For the 1-seg model high-order potentials do 
not intersect so PatB is equivalent to CWD. 
Comparison against a-expansion. We report the com¬ 
parison of the SMR method against the original a- 
expansion-based algorithm of Kohli et al. [23]. We run 
SMR (L-BFGS optimization) and a-expansion meth¬ 
ods on the energies constructed for the test set of 
the MSRC dataset (256 instances) as described in the 
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TABLE 2 

Comparison of the energies obtained by the SMR and 
the a-expansion [23] aigorithms on the two modeis 
with robust 'P”-Potts potentiais. 



1 segmentation 

3 segmentations 

percentile 

Q;-exp. 

SMR 

Q-exp. 

SMR 

25% 

2.34 

0 

2.40 

0 

50% 

9.11 

0.04 

9.22 

0.44 

75% 

31.60 

5.22 

25.46 

4.92 


previous section. The SMR approach provides the 
certificate of optimality for 194 (out of 256) instances 
for the 1-seg model and for 197 (out of 256) instances 
for the 3-seg model. The a-expansion method finds 
the energy equal to the lower bound obtained by SMR 
in 145 and 147 cases for the two models respectively. 
We exclude energies where both methods found the 
global minimum from further analysis. We subtract 
from each energy such a constant that the best found 
lower bound equals zero and compare the obtained 
energies in table 2. For each method and each model 
we report the median (50%), lower (25%) and upper 
(75%) quartiles of obtained energies. 

The median SMR convergence time up to accuracy 
of 10“^ for the 1-seg model is 23.4 seconds for the 
cases with no integrality gap and 136.3 seconds for the 
cases with the gap. For the 3-seg model the analogous 
numbers are 24.0 seconds and 164.3 seconds. The 
median running time for the a-expansion for the two 
models was 2.6 and 6.3 seconds. 


7.3 Global linear constraints 

In this section we experiment with the ability of SMR 
to take into account global linear constraints on the 
indicator variables. We compare the performance of 
SMR against the available baselines. See Osokin et al. 
[35] for the evaluation of SMR on interactive image 
segmentation and magnetogram segmentation tasks. 
Dataset. We use a synthetic dataset similar to the 
one used by Kolmogorov [25]. We generate 10-label 
problems with 50 x 50 4-neighborhood grid graphs. 
The unary potentials are generated i.i.d.: 6ip Af{0, 1). 
The pairwise potentials are Potts with weights gen¬ 
erated i.i.d. as an absolute value of A/’(0,0.5). As 
global constraints we've chosen strict class size con¬ 
straints (23). Class sizes are deliberately set to be 
significantly different from the global minimum of 
the initial unconstrained energy. In particular, for class 
p = 1,..., 10 we set its size to be |V|p/X]J=i Q- 
Baselines. First, we can include the global linear 
constraints in the double-loop scheme where the in¬ 
ner loop corresponds to the TRW-like algorithm that 
solves the standard LP-relaxation and the outer loop 
solves the Lagrangian relaxation of the global con¬ 


straints: 

M K 

max LBtrw(0) - ^ ^ 7rfcd^ (47) 

m=l k=l 

M K 

S.t. e^p = 6jp imW'^p TTkVjp, Vj e V, Vp G v, 

m—1 k—1 

^ij,pq — j} ^ ^i '^Pi Q ^ i 

TTfc > 0, yk = 1,..., K. 

Here LBtrw {d) is the lower bound (we use TRW- 
S [25]) on the global minimum of the energy with 
potentials defined by 9. We will refer to this algorithm 
as GTRW. 

Another baseline is based on the MPF framework 
of Woodford et al. [45]. Let LBtrw{S{h)) be the 
solution of the standard LP-relaxation of the energy 
with modified unary potentials: djpifJ-) = djp — lJ,jp, 
9ij^pq{f^') — Oij^pq. Define 

^G(M) = min (/x,y), (48) 

y 

s-t- X! y^p = % ^ 

iev 

0 < Vjp < 1, Vj G V, Vp G V. 

Task (48) is a well-studied transportation problem (can 
be solved when Y.p^v % = 1^1)- 

Function LBtrw{0{b)) -f ^g(A‘) is concave and 
piecewise linear as a function of /r. In this experiment 
we compute LBtrw{S{p)) with TRW-S and - 

with the simplex method. We refer to this method as 
MPF. 

Results. Fig. 6 shows the convergence of SMR, 
DD TRW, and MPF averaged over 50 randomly gener¬ 
ated problems. For MPF we do not consider the time 
required to solve the transportation problem 'LcifJ') 
since we do not have the efficient implementation for 
the simplex method. For all the methods we report 
the lower bounds, the energies and the total constraint 
violation of the obtained primal solutions.^® Based on 
fig. 6 we conclude that all three methods converge to 
the same point but the SMR lower bound converges 
faster and hence a primal solution with both low 
energy and small constraint violation can be found. 

8 Conclusion 

In the paper we propose a novel submodular relax¬ 
ation framework for the task of MRF energy min¬ 
imization. In comparison to the existing algorithms 
based on the decomposition of the problem's h 5 rper- 
graph to h 5 q)eredges and trees, our approach requires 
fewer dual variables and therefore converges much 

10. The choice of primal solution is performed by the following 
heuristics: in SMR all pixels with conflicting labels in the same 
connected component were assigned to the same randomly selected 
label from the set of conflicting labels; in GTRW and MPF primal 
solution was chosen as the labeling generated by TRW-S. 
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- SMR dual -SMR primal - SMR constraint violation 

- GTRW dual - GTRW primal - GTRW constraint violation 

- MPF dual -MPF primal - MPF constraint violation 



Time (seconds) Time (seconds) 

(a): Lower bounds and energies (b): constraint violation 


Fig. 6. Comparison of SMR, GTRW and MPF meth¬ 
ods. Piot (a) shows the iower bounds and the energies 
obtained by each method. Note that the energies can 
be iower than the iower bounds because the iabeiings 
vioiate the giobai hard constraints. Piot (b) shows the 
totai constraint vioiation of the primai soiution (mea¬ 
sured in the percentage of pixeis that have to be 
recoiored to make the constraints consistent). For each 
curve we report the median, the iower and the upper 
quartiies. 


faster especially when the number of labels is rela¬ 
tively small. State-of-art convex optimization methods 
together with the dynamic max-flow algorithm make 
the optimization of the SMR dual efficient so that SMR 
can be potentially applied to larger problems with 
sparse high-order potentials. We study the theoretical 
properties of the dual solution provided by SMR 
under various conditions and show the equivalence 
of SMR lower bound to the specific LP relaxation of 
the initial discrete problem. 
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Appendix A 

Additional plots for sec. 7.1 

Fig. 7 provides additional plots for the experiment 
described in sec. 7.1 of the main paper. Fig. 7a- 
b show zoomed versions of lower-bound plots for 
image segmenfation and stereo instances (fig. 4a-b 
of the main paper). Fig. 7c-d show values of fhe 
energy (primal) achieved af each iteration. Fig. 7e-f 
are zoomed versions of fig. 7c-d. 


—A— SMR subgradient —•— SMR bundle ▼ SMR L-BFGS 

—■— SMR aggr. bundle — SMR LMBM 

— DD TRW subgradient DD TRW bundle DD TRW L-BFGS 

— DD TRW aggr. bundle — DD TRW LMBM -a— TRW-S 





Fig. 7. Extra plots for the experiment in sec. 7.1. 


Appendix B 

Non-submodular Lagrangian 

B.1 The family of lower bounds 

Consider the task of minimizing the pairwise energy 
formulated in the indicator notation: 

min EE ^ipUip E E E (^ij,pqyipVjq^ (49) 

i&V p^V p^q&V 

S.t. 2/ip e {0,1}, VfeV,pGT’, (50) 

^ y^p = 1, w G V, (51) 

P^V 

Sec. 4.1 of the main paper considers the case when 
all values Oij^pq of the potentials are non-positive. 
In this section we elaborate on the case when it is 
not true. To apply the SMR method (see sec. 4.2) 
in this case we have to subtract from each pairwise 
pofential a consfant equal to the maximum of its 
values: rnaxp.^ Oij^pq. This procedure can badly af¬ 
fect the sparsity of the initial potentials, e.g. if the 
potential specifically prohibifs certain configurations. 
If we drop the subtraction step we retain sparsity 
but expression (49) is no longer guaranteed to be 
submodular, thus minimizing it over the Boolean cube 
is hard. Instead of solving this minimization exactly 
we can solve its standard LP relaxation using QPBO 
method [5], [26] and further perform Lagrangian re- 
laxafion of the consistency constraint (51). We refer to 
this approach as nonsubmodular relaxation (NSMR) and 
describe it more formally. 
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Consider the following Lagrangian: 


L{y, A) — OipUip + ^^Pij,pqyipyjq 

iGV iGV PjQ^'P 

+E^*(Ey*p-i) (52) 

iev Ve?" ^ 


It is a pairwise pseudo-Boolean function w.r.t. vari¬ 
ables y so we can apply the QPBO algorithm. The 
algorithm provides us with a lower bound Dqpbo{^) 
on the solution of (49)-(51) and a partial labeling of 
y, i.e. each yi is assigned a value from set {0,1,0}. 
The QPBO lower bound is known to be equal to the 
bound provided by the standard LP relaxation [5, th. 6 
and 7] and thus function DqpboW concave and 
piecewise-linear w.r.t. A. 

To compute a subgradient of DqpboW we use 
the fact that the standard LP relaxation of a pair¬ 
wise energy minimization problem is half-integral [5, 
prop. 20]. Specifically, if QPBO assigns 0, 1, or 0 to yip 
the corresponding unary variable of the LP-relaxation 
(yip) equals 0,1, or 0.5 respectively. Pairwise variables 
yij,pq can be set by the following rule (see lemma 3): 


yij,pq 


niin(i/*p,yj,), 9 ij,pq < 0, 

max(0, yip + yjq — 1), otherwise. 


(53) 


We explore tightness of the NSMR lower bound in 
section B.2 and compare it experimentally against the 
SMR with the subtraction trick in sec. C. 


B.2 Tightness of the NSMR bound 

Theorem 8. The lower bound given by NSMR (sec. B.l) 
method applied to a pairwise energy equals to the minimum 


value of the following linear program: 

min EE Gipyip E E 9ij,pqyij,pq, (54) 

^ iGV pGV {i,j}C:S p.qCiV 

s.t. yip,yij,pq > [0;1]) (55) 

yio,pq <yip, yij,pq <yjq, Vjf, j }e £,^p,q€V, (56) 
yij,p^yip~byjp l, V{i,j}sf,Vp,(57) 
^ y^p = 1, Vi € V. (58) 

Proof: We start the proof with the lemma that 


formulates the standard LP relaxation of a pairwise 
energy in a convenient form: 

Lemma 3. Consider the problem of minimizing energy 
E{y) = binary vari¬ 

ables y = {yi}ieA where A and B are sets of unary and 
pairwise terms correspondingly. The standard LP relax¬ 
ation of the problem can be formulated as follows: 

min '^aiyi+ ^ bijpij, (59) 

^ *6.4 {i,j}GB 

S.t. yi,yij>0, (60) 

Vi] < yt, Vzj < Vj, Vji, j} e B, (61) 
Vij >yi + yj - 1, V{i, j} e B. (62) 


Proof At first, recall how the standard LP relax¬ 
ation looks like in this case: 

min ^ ^ (liZii -|- E ^ij ■) (63) 

i&A {i,j}€B 

s.t. Zii^ Zio^ ZijBl, ZijAl, ZijAO, ^ij ,00 ^ (^4) 

Zil -T Zio = 1, (65) 

Zij.lO Zijpi = 2,1, '^^1,01 T -2,1,00 ~ -2,0, 

2,1,01 4“ 2,j,ii — Zji^ 2,1,00 4“ 2ii,io — ^I'O- (66) 

Given the solution of LP (63)-(66) we can construct 
the solution of LP (59)-(62) with the same value of 
the target function by setting yi = zn, yij = ZijBi- 
The feasibility of this assignment is implied by non¬ 
negativity of 2 and constraints (66). 

On the other hand given the solution of (59)-(62) we 
can set 2 ii — yj, ZjQ — 1 2 ii, 2 ii,ii — yijf 2 , 1,01 — 2 ii 

2,j,ll/ 2ii,io — yn 2ii,ii, 2,1,00 — 2,1,11 -f 1 yn yjif 

which implies the feasibility by construction. □ 

Now we complete the proof of theorem 8. QPBO 
lower bound Dqpbo{^) is known to be equal to 
the minimum of the standard LP relaxation [5, th. 6 
and 7] and therefore can be expressed in form (59)-(62) 
that corresponds to LP (54)-(57). Applying Lagrangian 
relaxation to consistency constraints (58) (lemma 2) 
adds them to the LP. □ 

Appendix C 

Comparison of theorem 8 and corol¬ 
lary 1 

SMR can be applied to problem (49)-(51) in case 
when some Oij^pq > 0 using the subtraction trick 
described in sec. 4.2. In this section we compare 
the LP obtained this way with the one provided by 
theorem 8. The LP given by corollary 1 together with 
the the subtraction trick looks as follows: 

rmn f E!(%.P 9 ~ b)ii)yij,pq + 9M , 

i^Vp^V 

(67) 

s.t. j/,p e [0,1], Vi e V, Vp G V, 

y,i,P5 G [0,1], V{z, j} G E, Vp, qCV, 
yij.pqfSyip, yij,pq"Cyjq, V{i, j} G£^, Vp, <7 g7^, (68) 
Ec-dV*p = 1’ Vi G V (69) 

where Of = maxp^^ Oij^pq. The optimal point of LP (67)- 
(69) is feasible for LP (54)-(58) and vice versa. Note, 
that at the same time the polyhedra defined by the 
two sets of constraints are not equivalent. 

We provide the experimental comparison of the 
integrality gap in problem (54)-(58) against the inte¬ 
grality gap of (67)-(69) and the integrality gap given 
by the TRW-S [25] and DD TRW [30] algorithms. 

We generate an artificial data set similar to the one 
used in sec. 7.3 of the main paper. Specifically we 
generate 50 energies of size 20 x 20 with 5 labels each. 
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TABLE 3 

Comparison of integrality gaps provided by NSMR, 
SMR with the subtraction trick and TRW-S. We report 
the results aggregated over 50 energies. For the three 
different percentiles (the median and the two 
quartiles) we report the integrality gaps. 


percentile 

NSMR 

Subtraction 

TRW-S 

DD TRW 

25% 

0 

332 

0 

0 

50% 

0.0012 

355 

0.0034 

0.0021 

75% 

0.1621 

381 

0.3188 

0.1155 


The unary potentials are generated i.i.d.: Oip Ar(0,1). 
The pairwise potentials are Potts with weights gener¬ 
ated i.i.d. from A/^(0,0.5). Note, that many of fhese 
potenfials are nof associative. 

We report the results in table 3. We can conclude 
that NSMR provides much tighter relaxation than the 
SMR with the subtraction trick and is comparable to 
the relaxation provided by TRW-S. 


Appendix D 
Proof of theorem 1 

Theorem 1. Rule + Aj, = Af i + 

jr ^^ 2 ) — applied to a pairwise energy 

with associative pairwise potentials delivers the maximum 
of function D{\) w.r.t. variable \j. 


Proof: We start with proving lemma 4. 


Lemma 4. Consider a pseudo-Boolean function T’ : B” —>• 
K. Denote the difference between the min-marginals of vari¬ 
able i and labels 0 and 1 with Si: Si = MMi^QF—MMi^iF. 
The minimum of the energy with one unary potential added 
can be computed explicitly: 


min iF(z) SzA 

zGB" 


MM.^oF, S>S^, 
MMi^iF + S, S<S,. 


Proof: Recall the definition of fhe min-marginals 
of F: MMifiF = mhiz-. zi=o F(z) and MMi^iF = 
min^.. zi=i F{z). Min-marginals of fhe extended func¬ 
tion Fz + Szi of variables Zi can be computed 
analitically: MMi Q{F + Szi) = min^. 

MMi^i{F + Szi) = min^: zi=i F{z) F S. Equality 


min Fz + Szi = min{MMi^Q[FFSzi), MMi^i{F + Szi)) 

finishes fhe proof. □ 

Lef us finish fhe proof of fheorem 1. We exploif 
fhe facf fhat in case of pairwise associative potentials 
the dual function can be represented as the following 
sum: 

D{X) = V min<Pp{yp, A) - V A* (70) 

Vv 

per iev 

where each summand <Pp{yp,\) is a submodular 
function w.r.t. variables yp. It suffices to show that 
L»(A”®“') > D{X^) where Af = Xf‘‘, i j, and 
X^- = Xf<^ -L 5, 5 € R. 


Consider case when S < Aj. We have S < where 
Sl^=MMj,o iVpW , A°'‘')-MM,-1 (y^o), A°'‘'). 

Lemma 4 and inequality S < ensure that 



(71) 


For all other summands of fhe dual (70) inequalifies 

cain^piyp, A”®"") > min^>p(yp, A"^), Vp pf'' (72) 

hold because Aj > S. 

After summing up (71) and (72) we use (70) and get 

^(A«e«,) > D{X^). 

Now consider case S > Aj. We have S > 5 ^ 2 ) where 
<i^2)=MM,-0 <f>p(2) (y^(2), A°'‘')-MM,-1 (y^(.), A°'‘'). 

Lemma 4 and inequality S > <1^2) ensure that 
miii^p(yp, A”®"") = min^>p(yp, A"^), \jp f. pf\ (73) 
and 



(74) 


After summing up (73) and (74) we use (70) and get 

L>(A”®“') > D{X^). □ 

Appendix E 
Proof of theorem 2 

Theorem 2. A maximum point X* of the dual func¬ 
tion D{X) satisfies the weak agreement condition. 

Proof: Let us assume the contrary: point A* is not 
a weak agreement point. This implies that for some 
node j G V either there exist labels p f q such that 
Zjp{X*) = Zjq{X*) = {1} or Zjp{X*) = {0} holds for 
all labels p. In the first case we can take A^ such that 
for all i j holds A^ = A* and Aj = A* -I- e where 
e > 0. If e is small enough {1} still belongs to both 
Zjp{X^) and Zjq{X^). Hence y* G ArgminT(y, A^) but 
D{X^) > D{X*) contradicting the assumption that A* 
is a maximum point point of D{X). The second case 
can be proven by considering Aj = A* — e. □ 

Appendix F 
Proof of theorem 3 

Theorem 3. Coordinate ascent algorithm (fig. 2) when 
applied to a pairwise energy with associative pairwise 
potentials returns a point that satisfies the weak agreement 
condition. 

Proof: When converged the algorithm returns a 
point A where S^ < 0 < p € V \ j G V. 

Consequently 1 G (i) and 0 G Zj^p, p CP \ {p^^^}, 
which means that the weak agreement is satisfied. □ 
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Appendix G 
Proof of lemma 1 

Lemma 1. Let yi,... ,yK be real numbers belonging to 
the segment [0,1], K > 1. Consider the folloiuing linear 
program: 

K 

min ziK — 1) — Zk (75) 

s.t. Zk < z, Zk< yk- (76) 

Point z = zi = ■ ■ ■ = Zk = min^^i. k Vk delivers the 

minimum of LP (75)-(76). 

Proof: Without restricting the generality let us 
assume that yi < • ■ • < yx- Denote the solution 
of (75)-(76), with z*,zf...,zf, and the value of 
solution - with /*. Variable z has a positive coefficient 
in the linear function (75) so z* = z^. 

Analogously z^ = min(yfc,z*). 

Let us show that z* > yi. Assume the opposite, 
i.e. £ = 2/1 — z* > 0. Point z* + e, zj" + £,..., zj- + e 
is feasible and at it function (75) equals /* — e, which 
contradicts the optimality of /*. 

Let us show that z* < 2 / 2 - Assume the opposite. 
There are two case possible: 

1) TJK < z*. Denote e = z* — 2 /if > 0. Point z* — 
£, zf...,zf is feasible and delivers the value 
of (75) equal to /* — £, which contradicts the 
optimality of /*. 

2) yi < z* < yi+i, for some i e {2,... ,K}. Denote 
e = z* — yi > 0. Point z* — e, zf , z|, z|_,_^ — 
e,... ,zf — s is feasible and delivers the value 
of (75) equal to /* — e{£ — 1), which contradicts 
the optimality of /*. 

All points of form z* G [ 2 / 1 , 2 / 2 ]/ = yi, Z 2 = • ■ • = 

zf = z* are feasible and deliver the optimal value /* 
to function (75). The point specified in the statement 
of the lemma is of this form. □ 
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